Introduction

Welcome to my GitHub Pages portfolio

This portfolio was created as part of the Data Science Workflows course, which is part of the Data Science for Biology minor. It showcases a variety of skills I’ve developed, including the creation of an R package and examples of my typical work in RStudio.

In addition to my R-based work, I’ve also completed a project using QIIME2 and Snakemake to analyze microbiome sequencing data, which helped me explore the integration of data science with bioinformatics tools. Each project in this portfolio demonstrates the techniques and workflows I’ve applied, highlighting my growth in data science and my interest in combining biological research with computational analysis.

Curriculum Vitae - Yusra Kunduzi

Figure 1: Curriculum Vitae.

Gut Microbiome Analysis in Crohn’s Disease patients Using Snakemake and QIIME2

The human gut microbiome plays an important role in maintaining overall health. It helps with digestion, nutrient absorption, and keeping the immune system balanced. When this microbial balance is disturbed, it can contribute to long-term health problems. One example is Crohn’s disease, a condition in which both the immune system and the gut microbiome appear to be disrupted.

To better understand the potential microbial shifts associated with Crohn’s disease, publicly available 16S rRNA sequencing data from the GSE162844 dataset was analyzed. This dataset includes ileal mucosal biopsy samples from individuals diagnosed with Crohn’s disease as well as healthy controls. The goal of this project was to explore gut microbiome composition and diversity in these two groups using a reproducible workflow built with QIIME2 and Snakemake.

Snakemake and QIIME2 workflow

Although Snakemake is mainly based on Python, it is very flexible and can work with tools beyond Python code. It can run external programs like QIIME2 by using shell commands. QIIME2 is a popular microbiome analysis platform used to process raw 16S rRNA gene sequencing data, especially for studying microbial diversity. In a Snakemake workflow, each QIIME2 step with its input and output files, can be defined as a separate rule. This setup enables automated and efficient processing of sequencing data, from initial raw data handling to final analysis and identification of microbial species.

Set up and environment

This analysis was performed in VS Code using an Ubuntu terminal running through WSL. For the installation of QIIME2, a conda environment was created. Within the same environment, a compatible version of Snakemake was installed. The latest version of QIIME2 is 2025.4, which is compatible with Snakemake version 7.32.4 that runs on Python 3.10.

QIIME 2 was installed following the official QIIME2 - 2025.4 installation guide.

For the QIIME 2 analysis, a SILVA classifier is required. This is a pre-trained model used to assign 16S rRNA sequences to bacterial taxonomies. The dataset from GSE162844 focuses on the gut microbiome, specifically the V3–V4 regions. For this analysis, the silva-138-99-nb-classifier.qza was used, as it is compatible with the current QIIME 2 version and suitable for this region of the 16S gene.

The snakemake pipeline

To run the snakefile, the following command was used: snakemake –cores 1

The taxonomy classification step can utilize multiple cores but is very demanding on system resources, so the number of cores was limited to 1 to ensure stable and reliable execution.

Rule all and data import

rule all:
    input:
        "taxa-bar-plots_deblur.qzv",
        "demux-paired-end.qzv",
        "demux-filter-stats.qzv", 
        "deblur-stats.qzv", 
        "table-deblur.qzv",
        "rep-seqs-deblur.qzv", 
        "aligned-rep-seqs.qza", 
        "masked-aligned-rep-seqs.qza",  
        "unrooted-tree.qza", 
        "rooted-tree.qza", 
        "core-metrics-results_deblur_250_metadata_100", 
        "alpha-rarefaction.qzv", 
        "taxonomy_deblur.qzv"    

rule import_data:
    input:
        manifest = "paired-end-manifest.csv"
    output:
        demux_paired_end = "demux-paired-end.qza"
    shell:
        """
        qiime tools import \
          --type 'SampleData[PairedEndSequencesWithQuality]' \
          --input-path {input.manifest} \
          --output-path {output.demux_paired_end} \
          --input-format PairedEndFastqManifestPhred33V2
        """

The first code chunk of a Snakefile always contains a ‘rule all’. This rule specifies the final target files that the workflow aims to generate. When you run Snakemake, it checks which of these output files already exist and will skip any steps that produce files already present. This makes the workflow more efficient by avoiding unnecessary computations.

Following rule all, the pipeline typically starts with the import_data step, where raw sequencing data is imported into QIIME2’s format. In this case, the input is a manifest file listing paired-end sequence files, and the output is a demultiplexed artifact (demux-paired-end.qza). “Demultiplexed” means the sequences have been separated by sample based on their barcodes, so each sequence is correctly assigned to its original sample.


Quality Control

rule demux_summarize:
    input:
        demux_paired_end = "demux-paired-end.qza"
    output:
        demux_paired_end_vis = "demux-paired-end.qzv"
    shell:
        """
        qiime demux summarize \
        --i-data {input.demux_paired_end} \
        --o-visualization {output.demux_paired_end_vis}
        """

rule quality_filter:
    input:
        demux_paired_end = "demux-paired-end.qza"
    output:
        demux_filtered_seq = "demux-filtered-seq.qza",
        demux_filtered_stats = "demux-filter-stats.qza"
    shell:
       """
       qiime quality-filter q-score \
       --i-demux {input.demux_paired_end} \
       --o-filtered-sequences {output.demux_filtered_seq} \
       --o-filter-stats {output.demux_filtered_stats}
       """

rule quality_visualize:
    input:
        demux_filtered_stats_qza = "demux-filter-stats.qza"
    output: 
        demux_filtered_stats_qzv = "demux-filter-stats.qzv"
    shell:
        """
        qiime metadata tabulate \
        --m-input-file {input.demux_filtered_stats_qza} \
        --o-visualization {output.demux_filtered_stats_qzv}
        """

These steps focus on importing the sequencing data and assessing its quality. The demultiplexed sequences are summarized to get an overview, then filtered to remove low-quality reads, and the filtering results are visualized to ensure data reliability before further analysis.


Denoising and Feature Table Processing


rule deblur_denoise:
    input:
        demux_filtered_seq = "demux-filtered-seq.qza"
    output:
        rep_seqs = "rep-seqs-deblur.qza",
        table_deblur = "table-deblur.qza",
        deblur_stats = "deblur-stats.qza"
    shell:
        """
        qiime deblur denoise-16S \
        --i-demultiplexed-seqs {input.demux_filtered_seq} \
        --p-trim-length 250  \
        --o-representative-sequences {output.rep_seqs} \
        --o-table {output.table_deblur} \
        --p-sample-stats \
        --o-stats {output.deblur_stats}
        """

rule deblur_visualize: 
    input:
        deblur_stats_qza = "deblur-stats.qza"
    output:
        deblur_stats_qzv = "deblur-stats.qzv"
    shell:
        """
        qiime deblur visualize-stats \
        --i-deblur-stats {input.deblur_stats_qza} \
        --o-visualization {output.deblur_stats_qzv}
        """

rule feature_table:
    input:
        deblur_qza = "table-deblur.qza",
        metadata = "metadata.txt"
    output:
        deblur_qzv = "table-deblur.qzv"
    shell:
        """
        qiime feature-table summarize \
        --i-table {input.deblur_qza} \
        --o-visualization {output.deblur_qzv} \
        --m-sample-metadata-file {input.metadata}
        """

rule filter_table:
    input:
        table = "table-deblur.qza"
    output:
        filtered_table = "table-deblur_filtered.qza"
    shell:
        """
        qiime feature-table filter-features \
          --i-table {input.table} \
          --p-min-frequency 10 \
          --o-filtered-table {output.filtered_table}
        """

rule filter_rep_seqs:
    input:
        rep_seqs = "rep-seqs-deblur.qza",
        table = "table-deblur_filtered.qza"
    output:
        rep_seqs_filtered = "rep-seqs-deblur_filtered.qza"
    shell:
        """
        qiime feature-table filter-seqs \
          --i-data {input.rep_seqs} \
          --i-table {input.table} \
          --o-filtered-data {output.rep_seqs_filtered}
        """

rule feature_table_tabulate:
    input:
        rep_seqs_qza = "rep-seqs-deblur.qza"
    output:
        rep_seqs_qzv = "rep-seqs-deblur.qzv"
    shell:
        """
        qiime feature-table tabulate-seqs \
        --i-data {input.rep_seqs_qza} \
        --o-visualization {output.rep_seqs_qzv}
        """

This section cleans the sequencing data by removing noise and errors, producing high-quality representative sequences and feature tables. The resulting data is then filtered to exclude rare features, and visualizations are created to explore the sequence features.


Phylogenetic Analysis and Diversity Metrics


rule phylogenetic_diversity: 
    input:
        rep_seqs = "rep-seqs-deblur.qza"
    output:
        aligned_reps = "aligned-rep-seqs.qza",
        masked_aligned_reps = "masked-aligned-rep-seqs.qza",
        unrooted_tree = "unrooted-tree.qza",
        rooted_tree = "rooted-tree.qza"
    shell:
        """
        qiime phylogeny align-to-tree-mafft-fasttree \
        --i-sequences {input.rep_seqs} \
        --o-alignment {output.aligned_reps} \
        --o-masked-alignment {output.masked_aligned_reps} \
        --o-tree {output.unrooted_tree} \
        --o-rooted-tree {output.rooted_tree} \
        --p-n-threads 16 
        """

rule diversity_analysis:
    input:
        rooted_tree = "rooted-tree.qza",
        table_deblur = "table-deblur.qza",
        metadata = "metadata.txt"
    output:
        out_dir = directory("core-metrics-results_deblur_250_metadata_100")
    shell:
        """
        qiime diversity core-metrics-phylogenetic \
          --i-phylogeny {input.rooted_tree} \
          --i-table {input.table_deblur} \
          --p-sampling-depth 100 \
          --m-metadata-file {input.metadata} \
          --output-dir {output.out_dir}
        """

rule alpha_rarefaction:
    input:
        table_deblur = "table-deblur.qza",
        rooted_tree = "rooted-tree.qza",
        metadata = "metadata.txt"
    output:
        alpha_rarefaction = "alpha-rarefaction.qzv"
    shell:
        """
        qiime diversity alpha-rarefaction \
        --i-table {input.table_deblur} \
        --i-phylogeny {input.rooted_tree} \
        --p-max-depth 7500 \
        --m-metadata-file {input.metadata} \
        --o-visualization {output.alpha_rarefaction}
        """
        

Here, sequences are aligned and used to build phylogenetic trees. Diversity analyses are performed to examine the microbial community composition, including rarefaction analysis to check if the sequencing depth is sufficient for robust results.


Taxonomy Assignment and Visualization


rule taxonomy_classifier:
    input:
        silva_classifier = "silva-138-99-nb-classifier.qza",
        rep_seqs = "rep-seqs-deblur_filtered.qza"
    output:
        taxonomy_deblur = "taxonomy_deblur.qza"
    shell:
        """
        qiime feature-classifier classify-sklearn \
        --i-classifier {input.silva_classifier} \
        --i-reads {input.rep_seqs} \
        --o-classification {output.taxonomy_deblur} \
        --p-n-jobs 1
        """

rule taxonomy_meta_tabulate:
    input:
        taxonomy_deblur = "taxonomy_deblur.qza"
    output:
        taxonomy_deblur_qzv = "taxonomy_deblur.qzv"
    shell:
        """
        qiime metadata tabulate \
        --m-input-file {input.taxonomy_deblur} \
        --o-visualization {output.taxonomy_deblur_qzv}
        """

rule taxonomy_barplot:
    input:
        table_deblur = "table-deblur_filtered.qza",
        taxonomy_deblur = "taxonomy_deblur.qza",
        metadata = "metadata.txt"
    output:
        tax_barplot = "taxa-bar-plots_deblur.qzv"
    shell:
        """
        qiime taxa barplot \
        --i-table {input.table_deblur} \
        --i-taxonomy {input.taxonomy_deblur} \
        --m-metadata-file {input.metadata} \
        --o-visualization {output.tax_barplot}
        """

In this final part, taxonomy is assigned to the cleaned sequences using a trained classifier. The taxonomic assignments are then visualized and summarized with barplots, providing insight into the microbial composition of the samples.

QIIME2 view

After running the Snakemake workflow, key .qzv visualization files were generated and viewed using QIIME 2 View. These included alpha and beta diversity plots, which help evaluate within-sample diversity and differences between groups. Because the focus here is on comparing Crohn’s disease samples to healthy controls, beta diversity metrics are the most informative.

The Snakemake rule ‘diversity_analysis’ produces a folder containing various diversity plots. For this dataset, the most relevant visualizations are: jaccard_emperor.qzv, unweighted_unifrac_emperor.qzv, and weighted_unifrac_emperor.qzv. Each plot gives insight into a different aspect of how the microbiome differs between the groups. The Jaccard plot reflects whether the same species are present or absent between groups. The Unweighted UniFrac plot builds on this by also considering the evolutionary relationships between the species. The Weighted UniFrac plot adds information about species abundance, showing how the relative amounts of bacteria differ between the groups.

Figure 2: Jaccard PCoA plot
Figure 2: Jaccard PCoA plot

The Jaccard PCoA plot shows a clear separation between inflamed (red) and non-inflamed (blue) samples, which means that the types of bacteria present in each group are different. Since the Jaccard distance looks only at whether bacteria are present or not, this suggests that inflammation is linked to certain bacteria appearing or disappearing. The small amount of overlap between the groups shows that inflammation causes a clear change in which bacteria are found.

Figure 3: Unweighted unifrac PCoA plot
Figure 3: Unweighted unifrac PCoA plot

The Unweighted UniFrac plot shows some separation but also overlap between groups. Like Jaccard, it’s based on which bacteria are present, but it also includes how closely related those bacteria are. The overlap here means that while some species differ between groups, many of them are still closely related

Figure 4: Weighted unifrac PCoA plot
Figure 4: Weighted unifrac PCoA plot

The Weighted UniFrac plot shows the most overlap between inflamed and non-inflamed samples. This means that even though the two groups may differ in which species are present (as shown by the Jaccard plot), they still share many of the same dominant bacteria, just in different amounts.

Figure 5: taxonomy bar plot
Figure 5: taxonomy bar plot

The taxa bar plot shows the relative abundance of bacterial genera across all samples, grouped by inflammation status. While the plot gives a useful overview of the microbial composition and allows for the identification of present taxa, there is no clear pattern that visually distinguishes inflamed from non-inflamed samples. The microbial profiles appear highly variable between individuals, with substantial overlap between the two groups. As a result, this plot is helpful for exploring taxonomic diversity, but it does not reveal strong group-specific differences on its own. For detecting meaningful shifts between conditions, beta diversity metrics provide more informative results.

Beta diversity conclusion

The beta diversity analyses of the GSE162844 dataset suggest that there are some microbiome differences between Crohn’s disease patients and healthy individuals, but these differences are not strongly pronounced. The Jaccard metric indicated moderate changes in species presence or absence between groups, while the Unweighted UniFrac suggested that most of these changes involve closely related taxa. The Weighted UniFrac showed substantial overlap, implying that the core microbial community and relative abundances remain largely shared.

Together, these findings indicate that Crohn’s disease may be associated with subtle shifts in microbiome composition, rather than a complete restructuring of the gut microbiota. Detecting stronger signals may require more targeted sampling, deeper sequencing, or integration with clinical and functional data.

Snakemake-QIIME2 workflow conclusion

This project applied a reproducible Snakemake-QIIME2 workflow to explore microbial differences between Crohn’s disease patients and healthy controls using 16S rRNA data. While some variation in microbial composition was observed—particularly in species presence or absence—the overall differences between groups were limited and not strongly defined.

Still, the workflow successfully processed and visualized the data, showing its value for microbiome studies. It can easily be reused with different datasets by updating the input files and metadata, making it a flexible and scalable tool for future analyses.

Looking ahead

Where do I want to be in ~2 years time?

I’m currently in the final phase of my Life Sciences degree at Hogeschool Utrecht and will be graduating soon. In about two years, I hope to be working in a research-oriented lab environment, ideally in the field of microbiology. I want to be in a position where I can combine hands-on lab work with data analysis, particularly involving next-generation sequencing (NGS), so I can continue building on both my experimental and data science skills.

By that time, I aim to have gained valuable experience in the field and a better understanding of what type of work suits me best. If I come across a master’s program that really fits my interests, I might decide to continue studying, but for now, my goal is to focus on growing professionally and learning as much as possible in a real research setting.


How am I doing now with respect to this goal?

I’m well on my way to achieving my goals. Looking back, I sometimes don’t realize how much I’ve learned over the past few years, but I often surprise myself with what I’m capable of. I’m currently doing my research internship, the final phase of my studies before graduation, where I’m working with NGS data and focusing on further developing my data science and bioinformatics skills.


What would be the next skill to learn?

I would like to gain more hands-on experience with next-generation sequencing (NGS) in the lab. So far, I’ve mainly worked with NGS data during the analysis phase, but I’m really interested in learning how to carry out the full process from sample preparation to library construction and running the sequencer. In addition to that, I’d like to strengthen other general lab skills to become more confident in an experimental setting. My goal is to better connect what happens in the lab to the data I analyze, so I can understand the full workflow from sample to result and make more informed decisions during analysis.

Guerrilla Analytics on DAUR2 course

Guerrilla analytics is an approach to data management and analysis that focuses on efficiency, simplicity, and speed. It prioritizes creating practical workflows and making fast, effective choices with the available resources.

For my DAUR2 RNA sequencing and metagenomics assignments, I applied guerrilla analytics by reorganizing my project structure to optimize data management. I focused on keeping essential files well-organized, removed unnecessary large datasets, and provided clear documentation in readme.txt files. I visualized the folder structure using the {fs} package in R, as shown in the figure below, to make the project more organized and easier to navigate.

Figure 6: Directory tree of the DAUR2 course.

C. elegans plate experiment

In this experiment, C. elegans nematodes were exposed to varying concentrations of different compounds. The dataset, provided by J. Louter, includes key variables such as the number of offspring (RawData), compound names (compName), concentrations (compConcentration), and experimental types (expType). The aim of the analysis was to create visualizations using ggplot2 and outline the steps for performing a dose-response analysis to assess the relationship between compound concentrations and offspring count.

#change compName and expType into factors
flow_tidy$compName <- as.factor(flow_tidy$compName)
flow_tidy$expType <- as.factor(flow_tidy$expType)

#change compConcentration into numeric column
flow_tidy$compConcentration <- as.numeric(flow_tidy$compConcentration)

#remove rows with NA in RawData
flow_tidy <- flow_tidy %>%
  filter(!is.na(RawData))

#mean of controlNegative for normalization of data
control_negative_mean <- mean(flow_tidy$RawData[flow_tidy$expType == "controlNegative"], na.rm = TRUE)

#new column with normalized data (normalized to the controlNegative mean)
flow_tidy <- flow_tidy %>%
  mutate(NormalizedData = RawData / control_negative_mean)

#take the log10 value of compConcentration
flow_tidy$logConcentration <- log10(flow_tidy$compConcentration)

#plot the data with jitter to avoid overlapping points and color by compName
flow_tidy_plot <- flow_tidy %>%
  ggplot(aes(x = logConcentration, y = NormalizedData, 
             colour = compName, shape = expType)) +
  geom_jitter(width = 0.1, height = 0.1) + theme_bw() +
  labs(
    title = "C.elegans Nematodes Exposed to Varying Concentrations of Different Compounds", 
    x = "Log10 of the Concentration in nM",
    y = "Normalized Offspring Count"
  )

print(flow_tidy_plot)

Figure 8: Scatterplot showing the effect of different compounds at various concentrations on C. elegans offspring

The concentration of the compounds have a wide range, which isn’t ideal for visualization in the scatterplot. To solve this, the log10 value of compConcentration is used, reducing the distance between data points. Another issue is that compConcentration being of the class ‘character’ disrupts the order in which the data is displayed.

RawData is of class ‘numeric’, which is correct. compName and expType were both initially of class ‘character’ and have been changed to ‘factor’. compConcentration was of class ‘character’, but changed to ‘numeric’.

The positive control for this experiment is Ethanol. The negative control for this experiment is S-medium.

Normalizing the data gives a baseline for comparison, reducing variability. This makes it easier to spot meaningful differences by ensuring that any changes are due to the experimental conditions, not just natural differences in the data.

Analyzing the data

If this experiment is analyzed to determine if there is an effect of different concentrations on offspring count and whether the compounds have varying effects, the following steps should be followed:

1: Preparation of the Data: First, the data should be imported from the Excel file into RStudio to prepare it for analysis. During this process, columns should be assigned the correct data types, NA values should be removed, and the data should be normalized. Specifically, normalization should ensure that the mean value of the negative control is equal to 1, with all other values expressed as fractions of the negative control. This step is essential for comparing the effects of different compounds consistently.

2: Visualization of the Data with a Scatterplot: This step is important for identifying potential issues with the data and making any necessary adjustments. It allows you to visualize the relationship between compound concentrations and offspring count. Use the compound concentration on the X-axis and the offspring count (RawData) on the Y-axis, with different colors representing each compound and different shapes representing the experimental types. Jitter can be applied to prevent overlapping data points.

3: Dose-Response Curve (DRC): To better analyze the effects of the compounds on C. elegans and assess the relationship between dose and response, a DRC is necessary. This involves fitting a dose-response model using the {drc} package in R. Use the log-transformed compound concentration on the X-axis and the offspring count (RawData) on the Y-axis. The DRC will help quantify key parameters such as the IC50 value, as well as the minimal and maximal response levels.

4: Analysis/Conclusion of the DRC: The analysis should focus on determining the IC50 value for each compound, identifying the minimal and maximal concentrations affecting C. elegans, and comparing the dose-response curves to evaluate differences in compound effects. Controls should be used to validate the experimental results.

Review and Reproducibility

Open Peer Review

SARS-CoV-2 variants reveal features critical for replication in primary human cells

Since the emergence of the COVID-19 virus and the millions of people it infected, the amount of research on it has increased. Part of this research focuses on genome sequencing to identify various variants, but there has been limited research on how these mutations affect processes like virus replication or transmission. This article presents research on 14 different SARS-CoV-2 variants that emerged during the early stages of the pandemic in Europe. The goal of this research was to gain a better understanding of these variants. To achieve this, the variants were isolated from anonymized patient samples collected in Switzerland between March and May 2020. These samples were cultured in Vero-CCL81 cells, and primary bronchial epithelial cells (BEpCs) from three different human donors were also used to grow the variants. The virus was cultured through multiple passages in Vero-CCL81 cells to generate viral stocks.

The viral growth, genetic variants, and specific mutations, such as those in the Spike protein and other genes, were analyzed using sequencing and phenotypic assays. The full genome sequencing was performed using next-generation sequencing (NGS) methods. The results showed that certain mutations, such as the D614G substitution in the Spike protein, were associated with enhanced replication in human cells. Additionally, mutations that occurred during passage in Vero cells, such as deletions in the furin cleavage site, strongly affected replication in BEpCs. This highlights the importance of carefully checking viral stocks when studying new variants.

Statistical significance was determined using one-way ANOVA and unpaired t-tests on log2-transformed data. The research was conducted in a BSL3 facility, with all procedures, including risk assessments and protective measures, approved by the Swiss Federal Office of Public Health. All relevant data are available through GISAID (accession IDs EPI_ISL_590823 to EPI_ISL_590836). Although the article does not provide specific information on the availability of the code used, it offers extensive methodological details and transparency in the results.

(pohlSARSCoV2VariantsReveal2021?)

Peer Review Transparency Criteria Table
Transparency.Criteria Definition Response.Type
Study Purpose A concise statement in the introduction of the article, often in the last paragraph, that establishes the reason the research was conducted. Also called the study objective. Yes
Data Availability Statement A statement, in an individual section offset from the main body of text, that explains how or if one can access a study’s data. The title of the section may vary, but it must explicitly mention data; it is therefore distinct from a supplementary materials section. Yes
Data Location Where the article’s data can be accessed, either raw or processed. All SARS-CoV-2 isolate sequences are available from GISAID (accession IDs EPI_ISL_590823 to EPI_ISL_590836)
Study Location Author has stated in the methods section where the study took place or the data’s country/region of origin. Yes; Switzerland (Zurich)
Author Review The professionalism of the contact information that the author has provided in the manuscript. E-mail addresses for multiple autors are provided
Ethics Statement A statement within the manuscript indicating any ethical concerns, including the presence of sensitive data. No
Funding Statement A statement within the manuscript indicating whether or not the authors received funding for their research. Yes
Code Availability Authors have shared access to the most updated code that they used in their study, including code used for analysis. No

Reprocudibility assessment

Reproducibility in research is necessary as it allows other researchers to use the data and code for their own work. Unfortunately, the data or code isn’t always accessible, sometimes hindering progress. Even when both are available, it’s important to assess whether the code is truly reproducible. This is why I reviewed a paper titled “Effect of Targeted Behavioral Science Messages on COVID-19 Vaccination Registration Among Employees of a Large Health System” on PubMed.

The code and data for this study were available on OSF, with one file for the scripts and three files for the data. After reviewing the code, I would rate its readability a 3/5. While the code is not bad, it could be clearer. Some variable names were understandable, but as the number of variables increased, they became less clear. The code could also benefit from more consistent comments.

The reproducibility of the code is a 5/5. The three lines for importing the data, with placeholders for file paths like “INSERT PATH HERE” were clear. After inserting my own file paths, the code ran without issues, generating a bar chart identical to the one in the paper. The code first loads the datasets, then summarizes key variables like registration numbers, email opens, clicks, and deliveries. After preparing and cleaning the data, it generates three bar charts for visualisation.

library(dplyr)
library(psych)
library(gmodels)
library(zoo) ## for working with dates
library(chron) ##for working with dates
library(ggplot2)
library(effects) ## for CI
library(DT) ## for tables
library(RColorBrewer) ## for survey plot

VAC.SURVEY.PATH <- "C:/Users/yusra/OneDrive/Documenten/dsfb2_workflows_portfolio/ruwe_data/COVID_Vaccine_Employee_Survey_20210202_OSF.csv"


VAC.SURVEY <- read.csv(gsub("[\r\n]", "", VAC.SURVEY.PATH), header = TRUE,
                          na.strings = c("", "N/A", "NA"))

total <- nrow(VAC.SURVEY)
freq_table <- data.frame(table(VAC.SURVEY$main))
names(freq_table) <- c("Codes", "Freq")

tmp <- freq_table %>% arrange(Freq)
freq_table$Codes <- factor(freq_table$Codes, levels = tmp$Codes)
freq_table$Perc <- paste(round((freq_table$Freq/total)*100,1), "%", sep = "")

ggplot(freq_table, aes(x = Codes, y = Freq, fill = Codes)) +
    geom_bar(stat = 'identity', position = position_dodge(), color = 'black') + 
    theme_classic() +
    theme(axis.text.x = element_text(vjust = 1, hjust = 1),
          axis.text = element_text(size = 7, color = 'black'),
          strip.text = element_text(size = 7),
          axis.title = element_text(size = 7),
          axis.title.y = element_text(margin = unit(c(5, 0, 0, 0, 0), "mm")),
          legend.position = 'none') + xlab("") + ylab("\nNumber of respondents") +
    geom_text(aes(label=Perc), size=3, position=position_dodge(width=0.9), 
              hjust=-0.2, color='black') +
    scale_fill_manual(values = c('white', 'white', 'white', 'white', 
                                 'white', 'white', 'white', 'white', 
                                 'white', 'white', 'white',
                                 "#1B7837", "#5AAE61", "#A6DBA0", 
                                 "#D9F0D3", "#F7F7F7", "#E7D4E8", 
                                 "#C2A5CF", "#9970AB", "#762A83")) + 
    coord_flip() +
    ylim(c(0, max(freq_table$Freq)*1.3))

Figure 8: Reported reasons for COVID-19 vaccine hesitancy The figure shows the most common reasons respondents gave for not wanting to get the COVID-19 vaccine. The main concern was about unknown risks (36.2%), followed by pregnancy-related concerns (14.6%) and other reasons like previous infection, timing, and concerns about ingredients or side effects. Less common responses included religious reasons, privacy concerns, and low perceived risk.

New Rpackage

text.analyzer Package

The text.analyzer package simplifies text analysis in R. It provides a set of functions to quickly analyze text data, such as word and sentence counts, identifying the longest word, and extracting the most frequent words. The package is designed to be efficient, allowing users to perform basic text analysis tasks without the need for complex setups.

Development of the text.analyzer Package

I started developing the text.analyzer package by creating a new R package project with devtools::create_package(). This automatically set up the file structure and made the project ready to use in RStudio.

The first function I added was text_summary(), which summarizes text data. I documented it using roxygen2 and ran devtools::check() to fix any issues. After that, I added more functions, like longest_word(), top10_words(), and sentence_summary(), following the same process: writing the function, documenting it, testing it, and fixing any problems.

Once the package was complete, I installed it locally with devtools::install() to make sure everything worked as expected. I also created a vignette in RMarkdown, placed it in the vignettes/ folder, and built it using devtools::build_vignettes(). This made the vignette accessible through browseVignettes().

Finally, I uploaded the package to GitHub, set up continuous integration, and added a README.md file with installation instructions and examples to help users get started.

Installing text.analyzer

# Install using devtools
devtools::install_github("YusraKunduzi/text.analyzer")

# Install using pak (preferred for dependency management)
pak::pak("YusraKunduzi/text.analyzer")

Usage

library(text.analyzer)

#Example text, the first paragraph of this page
text <- "The text.analyzer package simplifies text analysis in R. It provides a set of functions to quickly analyze text data, such as word and sentence counts, identifying the longest word, and extracting the most frequent words. The package is designed to be efficient, allowing users to perform basic text analysis tasks without the need for complex setups."

#How to use text_summary
sum <- text_summary(text)
print(sum)
## $word_count
## [1] 56
## 
## $sentence_count
## [1] 4
## 
## $average_word_length
## [1] 5.285714
## 
## attr(,"class")
## [1] "text_summary"
#How to use longest_word
longest <- longest_word(text)
print(longest)
## $word
## [1] "text.analyzer"
## 
## $length
## [1] 13
## 
## $position
## $position$line
## [1] 1
## 
## $position$index
## [1] 2
#How to use top10_words
top <- top10_words(text)
print(top)
##       words Freq
## 1      text    3
## 2        to    3
## 3  analysis    2
## 4   package    2
## 5      word    2
## 6  allowing    1
## 7   analyze    1
## 8        as    1
## 9     basic    1
## 10       be    1
#How to use sentence_summary
sentences <- sentence_summary(text)
print(sentences)
## $sentence_count
## [1] 4
## 
## $avg_sentence_length
## [1] 14.25
## 
## $longest_sentence
## [1] " It provides a set of functions to quickly analyze text data, such as word and sentence counts, identifying the longest word, and extracting the most frequent words"
## 
## $shortest_sentence
## [1] "The text"
## 
## $readability_score
## [1] 86.65696
## 
## attr(,"class")
## [1] "sentence_summary"

Project - Rshiny app

Bacteria Analysis Board for VITEK® MS

The VITEK MS is a machine commonly used in clinical microbiology for the identification of yeasts, bacteria, and their antibiotic resistance or sensitivity. This mass spectrometry system utilizes biochemical tests to distinguish between bacterial species based on their distinct responses to a series of chemical reactions (De Respinis et al. 2014). The process begins by inserting a bacterial solution along with an identification or antibiotics card into the VITEK MS system. Each card contains a set of biochemical tests specific to the bacteria being analyzed. As the bacterial solution is drawn through the card, the system measures the extinction values corresponding to each test, indicating the bacteria’s reaction to the biochemical compounds. Each species has a unique pattern of extinction values, allowing the VITEK MS to accurately identify them based on their specific response patterns.

While the VITEK MS is efficient and has a large database for identifying a wide range of organisms, it does have limitations (Okabe et al. 2000). One of the main challenges is distinguishing closely related bacterial species that exhibit similar biochemical response patterns. For example, Streptococcus mitis and Streptococcus oralis, both Gram-positive bacteria, are often difficult to differentiate using the VITEK system alone. Their biochemical reaction patterns are so similar that the system frequently produces a combined result, indicating both species are present, rather than providing a definitive identification. This issue also occurs with the MALDI-TOF MS system, which, despite its large database, struggles to differentiate between these closely related species due to their highly similar protein profiles (Han, Jeong, and Choi 2021).

This project aims to address this limitation by investigating the differences in extinction patterns between S. mitis and S. oralis using the VITEK MS data. Through detailed analysis of the extinction outcomes from the biochemical tests on the Gram-positive cards, we aim to identify even the slightest differences between these two species. By leveraging advanced data analysis techniques, we will develop an application or website where users can input their VITEK extinction data files. These files will then be processed through custom scripts that allow users to visualize and interpret the data more effectively.

The core functionality of the Rshiny app will include the ability to generate visualizations such as scatterplots, boxplots, heatmaps, and Principal Component Analysis (PCA) plots (Leong et al. 2020). These tools will help users better understand how their samples compare across different tests. The primary goal is to provide a more accurate and detailed view of the subtle differences between closely related bacteria like S. mitis and S. oralis. For instance, scatterplots will allow users to see how individual tests correlate, while heatmaps will highlight patterns of bacterial response across all tests. PCA plots will help users observe the overall distribution of their samples, making it easier to spot any clustering or outliers that may indicate different bacterial species.

Already, some differences have been observed in the extinction patterns of certain tests between S. mitis and S. oralis. These findings have led to the development of a new feature for the Rshiny app: a Mitis/Oralis Checker. This tool will allow users to input their VITEK data and receive a more straightforward answer regarding whether they are dealing with S. mitis or S. oralis. The checker will be based on the identified patterns from the biochemical tests and will use the data from the visualizations as a reference. While the checker provides a quick and user-friendly solution, users will still have the option to view the detailed plots and analyses to understand the basis of the checker’s conclusions.

For now, the app will support any bacteria tested with the Gram-positive card, ensuring that a wide range of samples can be analyzed. This tool has the potential to improve bacterial identification accuracy in clinical microbiology by providing more precise results, especially in cases where traditional systems like the VITEK MS may fall short. Ultimately, this project aims to enhance the ability to distinguish between similar bacterial species.

Because the Rshiny app is already set up for Gram-positive identification cards, adding the information and tests for a Gram-negative card would not be difficult (Wallet et al. 2005). Researching and obtaining data from other bacteria that are difficult to distinguish would be helpful in developing additional checkers, like the Mitis/Oralis checker. Although these additions would be valuable, they are outside the scope of the current project and could be explored in future projects.

COVID-19 deaths and cases parameterized

This report visualizes the number of COVID-19 cases and deaths in a selected European country for a specific month and year, based on parameters defined. By adjusting the country, year, and month parameters, the same report can be reused to explore trends in different regions and time periods.

Obtaining the data and loading the necessary libraries

The data used in the graphs is obtained from the European Centre for Disease Prevention and Control under the dataset titled Data on the daily number of new reported COVID-19 cases and deaths by EU/EEA country. Please note that the data has not been updated since November 1, 2022.

library(readr)
library(ggplot2)
library(dplyr)
library(plotly)

data <- read.csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath_eueea_daily_ei/csv", na.strings = "", fileEncoding = "UTF-8-BOM")

Preparing the data

The data is filtered by the parameters chosen by the viewer of the plots.

#use paramaters from yaml header
filtered_data <- data %>%
  filter(countriesAndTerritories == params$country, year == params$year, month == params$month)

# Fix date format for ggplot
filtered_data$dateRep <- as.Date(filtered_data$dateRep, format = "%d/%m/%Y")

Plotting the COVID-19 death counts by parameters

# Plot the death count
deaths <- filtered_data %>% 
  ggplot(aes(x = dateRep, y = deaths)) + 
  geom_line(color = "darkred", linewidth= 1.5 ) + 
  theme_classic() + 
  labs(
    title = paste("COVID-19 Deaths in", params$country, "(", params$month, "/", params$year, ")"),
    x = "Date",
    y = "Number of Deaths"
  ) + #Rotating the axis a little for better visualization
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(deaths)

Plotting the COVID-19 cases by parameters*

# Plot the case count
cases <- filtered_data %>% 
  ggplot(aes(x = dateRep, y = cases)) + 
  geom_line(color = "orange", linewidth = 1.5) + 
  theme_bw() + 
  labs(
    title = paste("COVID-19 cases in", params$country, "(", params$month, "/", params$year, ")"),
    x = "Date",
    y = "Number of cases"
  ) + #Rotating the axis a little for better visualization
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(cases)

bibliography: My_Library.bib


References

De Respinis, Sophie, Valérie Monnin, Victoria Girard, Martin Welker, Maud Arsac, Béatrice Cellière, Géraldine Durand, et al. 2014. “Matrix-Assisted Laser Desorption IonizationTime of Flight (MALDI-TOF) Mass Spectrometry Using the Vitek MS System for Rapid and Accurate Identification of Dermatophytes on Solid Cultures.” Edited by Y.-W. Tang. J Clin Microbiol 52 (12): 4286–92. https://doi.org/10.1128/JCM.02199-14.
Han, Sang-Soo, Young-Su Jeong, and Sun-Kyung Choi. 2021. “Current Scenario and Challenges in the Direct Identification of Microorganisms Using MALDI TOF MS.” Microorganisms 9 (9): 1917. https://doi.org/10.3390/microorganisms9091917.
Leong, Claudia, Jillian J Haszard, Anne-Louise M Heath, Gerald W Tannock, Blair Lawley, Sonya L Cameron, Ewa A Szymlek-Gay, et al. 2020. “Using Compositional Principal Component Analysis to Describe Children’s Gut Microbiota in Relation to Diet and Body Composition.” The American Journal of Clinical Nutrition 111 (1): 70–78. https://doi.org/10.1093/ajcn/nqz270.
Okabe, Tadashi, Kozue Oana, Yoshiyuki Kawakami, Masaru Yamaguchi, Yuko Takahashi, Yukie Okimura, Takayuki Honda, and Tsutomu Katsuyama. 2000. “Limitations of Vitek GPS-418 Cards in Exact Detection of Vancomycin-Resistant Enterococci with the vanB Genotype.” J Clin Microbiol 38 (6): 2409–11. https://doi.org/10.1128/JCM.38.6.2409-2411.2000.
Wallet, Frédéric, Caroline Loïez, Emilie Renaux, Nadine Lemaitre, and René J. Courcol. 2005. “Performances of VITEK 2 Colorimetric Cards for Identification of Gram-Positive and Gram-Negative Bacteria.” J Clin Microbiol 43 (9): 4402–6. https://doi.org/10.1128/JCM.43.9.4402-4406.2005.